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Abstract 


We introduce Group equivariant Convolutional 
Neural Networks (G-CNNs), a natural general- 
ization of convolutional neural networks that re- 
duces sample complexity by exploiting symme- 
tries. G-CNNs use G-convolutions, a new type of 
layer that enjoys a substantially higher degree of 
weight sharing than regular convolution layers. 
G-convolutions increase the expressive capacity 
of the network without increasing the number of 
parameters. Group convolution layers are easy 
to use and can be implemented with negligible 
computational overhead for discrete groups gen- 
erated by translations, reflections and rotations. 
G-CNNs achieve state of the art results on CI- 
FAR1O and rotated MNIST. 


1. Introduction 


Deep convolutional neural networks (CNNs, convnets) 
have proven to be very powerful models of sensory data 
such as images, video, and audio. Although a strong the- 
ory of neural network design is currently lacking, a large 
amount of empirical evidence supports the notion that both 
convolutional weight sharing and depth (among other fac- 
tors) are important for good predictive performance. 


Convolutional weight sharing is effective because there is 
a translation symmetry in most perception tasks: the la- 
bel function and data distribution are both approximately 
invariant to shifts. By using the same weights to analyze 
or model each part of the image, a convolution layer uses 
far fewer parameters than a fully connected one, while pre- 
serving the capacity to learn many useful transformations. 
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Convolution layers can be used effectively in a deep net- 
work because all the layers in such a network are trans- 
lation equivariant: shifting the image and then feeding 
it through a number of layers is the same as feeding the 
original image through the same layers and then shifting 
the resulting feature maps (at least up to edge-effects). In 
other words, the symmetry (translation) is preserved by 
each layer, which makes it possible to exploit it not just 
in the first, but also in higher layers of the network. 


In this paper we show how convolutional networks can be 
generalized to exploit larger groups of symmetries, includ- 
ing rotations and reflections. The notion of equivariance is 
key to this generalization, so in section 2 we will discuss 
this concept and its role in deep representation learning. 
After discussing related work in section 3, we recall a num- 
ber of mathematical concepts in section 4 that allow us to 
define and analyze the G-convolution in a generic manner. 


In section 5, we analyze the equivariance properties of stan- 
dard CNNs, and show that they are equivariant to trans- 
lations but may fail to equivary with more general trans- 
formations. Using the mathematical framework from sec- 
tion 4, we can define G-CNNs (section 6) by analogy to 
standard CNNs (the latter being the G-CNN for the transla- 
tion group). We show that G-convolutions, as well as var- 
ious kinds of layers used in modern CNNs, such as pool- 
ing, arbitrary pointwise nonlinearities, batch normalization 
and residual blocks are all equivariant, and thus compatible 
with G-CNNs. In section 7 we provide concrete implemen- 
tation details for group convolutions. 


In section 8 we report experimental results on MNIST-rot 
and CIFAR10, where G-CNNs achieve state of the art re- 
sults (2.28% error on MNIST-rot, and 4.19% resp. 6.46% 
on augmented and plain CIFAR10). We show that replac- 
ing planar convolutions with G-convolutions consistently 
improves results without additional tuning. In section 9 we 
provide a discussion of these results and consider several 
extensions of the method, before concluding in section 10. 
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2. Structured & Equivariant Representations 


Deep neural networks produce a sequence of progressively 
more abstract representations by mapping the input through 
a series of parameterized functions (LeCun et al., 2015). In 
the current generation of neural networks, the representa- 
tion spaces are usually endowed with very minimal internal 
structure, such as that of a linear space R”. 


In this paper we construct representations that have the 
structure of a linear G-space, for some chosen group G. 
This means that each vector in the representation space has 
a pose associated with it, which can be transformed by the 
elements of some group of transformations G. This addi- 
tional structure allows us to model data more efficiently: A 
filter in a G-CNN detects co-occurrences of features that 
have the preferred relative pose, and can match such a fea- 
ture constellation in every global pose through an operation 
called the G-convolution. 


A representation space can obtain its structure from other 
representation spaces to which it is connected. For this to 
work, the network or layer ® that maps one representation 
to another should be structure preserving. For G-spaces 
this means that ® has to be equivariant: 


S(T; x) = T; ®(2), (1) 


That is, transforming an input x by a transformation g 
(forming T; x) and then passing it through the learned map 
® should give the same result as first mapping x through ® 
and then transforming the representation. 


Equivariance can be realized in many ways, and in particu- 
lar the operators T and T” need not be the same. The only 
requirement for T and T” is that for any two transforma- 
tions g and h, we have T(gh) = T(g)T(h) (ie. T isa 
linear representation of G). 


From equation 1 we see that the familiar concept of in- 
variance is a special kind of equivariance where db is the 
identity transformation for all g. In deep learning, general 
equivariance is more useful than invariance because it is 
impossible to determine if features are in the right spatial 
configuration if they are invariant. 


Besides improving statistical efficiency and facilitating ge- 
ometrical reasoning, equivariance to symmetry transforma- 
tions constrains the network in a way that can aid general- 
ization. A network ® can be non-injective, meaning that 
non-identical vectors x and y in the input space become 
identical in the output space (for example, two instances 
of a face may be mapped onto a single vector indicating 
the presence of any face). If ® is equivariant, then the G- 
transformed inputs T, x and T, y must also be mapped to 
the same output. Their “sameness” (as judged by the net- 
work) is preserved under symmetry transformations. 


3. Related Work 


There is a large body of literature on invariant representa- 
tions. Invariance can be achieved by pose normalization 
using an equivariant detector (Lowe, 2004; Jaderberg et al., 
2015) or by averaging a possibly nonlinear function over 
a group (Reisert, 2008; Skibbe, 2013; Manay et al., 2006; 
Kondor, 2007). 


Scattering convolution networks use wavelet convolutions, 
nonlinearities and group averaging to produce stable in- 
variants (Bruna & Mallat, 2013). Scattering networks have 
been extended to use convolutions on the group of transla- 
tions, rotations and scalings, and have been applied to ob- 
ject and texture recognition (Sifre & Mallat, 2013; Oyallon 
& Mallat, 2015). 


A number of recent works have addressed the problem 
of learning or constructing equivariant representations. 
This includes work on transforming autoencoders (Hin- 
ton et al., 2011), equivariant Boltzmann machines (Kivi- 
nen & Williams, 2011; Sohn & Lee, 2012), equivariant de- 
scriptors (Schmidt & Roth, 2012), and equivariant filtering 
(Skibbe, 2013). 


Lenc & Vedaldi (2015) show that the AlexNet CNN 
(Krizhevsky et al., 2012) trained on imagenet sponta- 
neously learns representations that are equivariant to flips, 
scaling and rotation. This supports the idea that equivari- 
ance is a good inductive bias for deep convolutional net- 
works. Agrawal et al. (2015) show that useful representa- 
tions can be learned in an unsupervised manner by training 
a convolutional network to be equivariant to ego-motion. 


Anselmi et al. (2014; 2015) use the theory of locally com- 
pact topological groups to develop a theory of statistically 
efficient learning in sensory cortex. This theory was imple- 
mented for the commutative group consisting of time- and 
vocal tract length shifts for an application to speech recog- 
nition by Zhang et al. (2015). 


Gens & Domingos (2014) proposed an approximately 
equivariant convolutional architecture that uses sparse, 
high-dimensional feature maps to deal with high- 
dimensional groups of transformations. Dieleman et al. 
(2015) showed that rotation symmetry can be exploited in 
convolutional networks for the problem of galaxy morphol- 
ogy prediction by rotating feature maps, effectively learn- 
ing an equivariant representation. This work was later ex- 
tended (Dieleman et al., 2016) and evaluated on various 
computer vision problems that have cyclic symmetry. 


Cohen & Welling (2014) showed that the concept of disen- 
tangling can be understood as a reduction of the operators 
T in an equivariant representation, and later related this 
notion of disentangling to the more familiar statistical no- 
tion of decorrelation (Cohen & Welling, 2015). 
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4. Mathematical Framework 


In this section we present a mathematical framework that 
enables a simple and generic definition and analysis of G- 
CNNs for various groups G. We begin by defining sym- 
metry groups, and study in particular two groups that are 
used in the G-CNNs we have built so far. Then we take a 
look at functions on groups (used to model feature maps in 
G-CNNs) and their transformation properties. 


4.1. Symmetry Groups 


A symmetry of an object is a transformation that leaves 
the object invariant. For example, if we take the sampling 
grid of our image, Z?, and flip it over we get —Z? = 
{(=n,—m)|(n,m) € Z?} = Z?. So the flipping oper- 
ation is a symmetry of the sampling grid. 


If we have two symmetry transformations g and h and we 
compose them, the result gh is another symmetry transfor- 
mation (i.e. it leaves the object invariant as well). Further- 
more, the inverse g~! of any symmetry is also a symmetry, 
and composing it with g gives the identity transformation 
e. A set of transformations with these properties is called a 
symmetry group. 


One simple example of a group is the set of 2D integer 
translations, Z?. Here the group operation (“composition 
of transformations”) is addition: (n,m) + (p,q) = (n+ 
p,m + q). One can verify that the sum of two translations 
is again a translation, and that the inverse (negative) of a 
translation is a translation, so this is indeed a group. 


Although it may seem fancy to call 2-tuples of integers a 
group, this is helpful in our case because as we will see in 
section 6, a useful notion of convolution can be defined for 
functions on any group!, of which Z? is only one exam- 
ple. The important properties of the convolution, such as 
equivariance, arise primarily from the group structure. 


4.2. The group p4 


The group p4 consists of all compositions of translations 
and rotations by 90 degrees about any center of rotation in 
a square grid. A convenient parameterization of this group 
in terms of three integers r, u, v is 


cos(rm/2) —sin(ra/2) u 
g(r,u,v) = | sin(r7/2) cos(rr/2) vI, (2) 
0 0 1 


where 0 < r < 4 and (u,v) € Z?. The group operation is 
given by matrix multiplication. 


The composition and inversion operations could also be 
represented directly in terms of integers (r, u,v), but the 


! At least, on any locally compact group. 


equations are cumbersome. Hence, our preferred method 
of composing two group elements represented by integer 
tuples is to convert them to matrices, multiply these matri- 
ces, and then convert the resulting matrix back to a tuple of 
integers (using the atan2 function to obtain r). 


The group p4 acts on points in Z? (pixel coordinates) by 
multiplying the matrix g(r, u,v) by the homogeneous co- 
ordinate vector x(u’, v’) of a point (u’, v’): 


cos(r7/2) —sin(r7/2) ul fu’ 
gx ~ |sin(ra/2)  cos(rm/2) v| |v (3) 
0 0 1 1 
4.3. The group p4m 


The group p4m consists of all compositions of translations, 
mirror reflections, and rotations by 90 degrees about any 
center of rotation in the grid. Like p4, we can parameterize 
this group by integers: 


(—1)” cos(%) —(—1)” sin( 3) u 
g(m,r, u,v) = sin( 5) cos( 5) vi, 
0 0 1 


where m € {0,1}, 0 < r < 4and (u,v) € Z?. The reader 
may verify that this is indeed a group. 


Again, composition is most easily performed using the ma- 
trix representation. Computing r, u, v from a given matrix 
g can be done using the same method we use for p4, and 
for m we have m = $(1 — det(g)). 


4.4. Functions on groups 


We model images and stacks of feature maps in a conven- 
tional CNN as functions f : Z? — R* supported on a 
bounded (typically rectangular) domain. At each pixel co- 
ordinate (p,q) € Z?, the stack of feature maps returns a 
k-dimensional vector f (p,q), where K denotes the num- 
ber of channels. 


Although the feature maps must always be stored in finite 
arrays, modeling them as functions that extend to infinity 
(while being non-zero on a finite region only) simplifies 
the mathematical analysis of CNNs. 


We will be concerned with transformations of the feature 
maps, so we introduce the following notation for a trans- 
formation g acting on a set of feature maps: 


[Lo f](x) = [fog '](2) = f(g" 2) (4) 


Computationally, this says that to get the value of the g- 
transformed feature map Lg f at the point x, we need to do 
a lookup in the original feature map f at the point g~'z, 


which is the unique point that gets mapped to x by g. This 
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operator L, is a concrete instantiation of the transformation 
operator T, referenced in section 2, and one may verify that 


LgLn = Lon. (5) 


If g represents a pure translation t = (u,v) € Z? then 
g~ ‘a simply means x — t. The inverse on g in equation 4 
ensures that the function is shifted in the positive direction 
when using a positive translation, and that Lg satisfies the 
criterion for being a homomorphism (eq. 5) even for trans- 


formations g and h that do not commute (i.e. gh # hg). 


As will be explained in section 6.1, feature maps in a G- 
CNN are functions on the group G, instead of functions on 
the group Z?. For functions on G, the definition of Ly is 
still valid if we simply replace x (an element of Z?) by h 
(an element of G), and interpret g~'h as composition. 


It is easy to mentally visualize a planar feature map f : 
Z? — R undergoing a transformation, but we are not used 
to visualizing functions on groups. To visualize a feature 
map or filter on p4, we plot the four patches associated with 
the four pure rotations on a circle, as shown in figure 1 
(left). Each pixel in this figure has a rotation coordinate 
(the patch in which the pixel appears), and two translation 
coordinates (the pixel position within the patch). 


E -E 
a. =e 
SH- SE- 


Figure 1. A p4 feature map and its rotation by r. 


When we apply the 90 degree rotation r to a function on 
p4, each planar patch follows its red r-arrow (thus incre- 
menting the rotation coordinate by 1 (mod 4)), and simul- 
taneously undergoes a 90-degree rotation. The result of this 
operation is shown on the right of figure 1. As we will see 
in section 6, a p4 feature map in a p4-CNN undergoes ex- 
actly this motion under rotation of the input image. 


For p4m, we can make a similar plot, shown in figure 2. 
A p4m function has 8 planar patches, each one associated 
with a mirroring m and rotation r. Besides red rotation 
arrows, the figure now includes small blue reflection lines 
(which are undirected, since reflections are self-inverse). 


Upon rotation of a p4m function, each patch again follows 
its red r-arrows and undergoes a 90 degree rotation. Un- 
der a mirroring, the patches connected by a blue line will 
change places and undergo the mirroring transformation. 
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Figure 2. A p4m feature map and its rotation by r. 


This rich transformation structure arises from the group op- 
eration of p4 or p4m, combined with equation 4 which de- 
scribes the transformation of a function on a group. 


Finally, we define the involution of a feature map, which 
will appear in section 6.1 when we study the behavior of 
the G-convolution, and which also appears in the gradient 
of the G-convolution. We have: 


fF) = F>) (6) 


For Z? feature maps the involution is just a point reflec- 
tion, but for G-feature maps the meaning depends on the 
structure of G. In all cases, f** = f. 


5. Equivariance properties of CNNs 


In this section we recall the definitions of the convolution 
and correlation operations used in conventional CNNs, and 
show that these operations are equivariant to translations 
but not to other transformations such as rotation. This is 
certainly well known and easy to see by mental visualiza- 
tion, but deriving it explicitly will make it easier to follow 
the derivation of group equivariance of the group convolu- 
tion defined in the next section. 


At each layer l, a regular convnet takes as input a stack of 
l . 
feature maps f : Z? + R* and convolves or correlates it 
. ; L 
with a set of K'*! filters y’ : Z? > RË : 


[f * ¥'](2) = 
(7) 
Lf «'](a) = 


If one employs convolution (*) in the forward pass, the cor- 
relation (x) will appear in the backward pass when comput- 
ing gradients, and vice versa. We will use the correlation in 
the forward pass, and refer generically to both operations 
as “convolution”. 


Using the substitution y — y + t, and leaving out the sum- 
mation over feature maps for clarity, we see that a transla- 
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tion followed by a correlation is the same as a correlation 
followed by a translation: 


[Lif] « V(x) = X fly — by - 2) 
=X fuwl 
=X fW- 


= [Lilf x Y]] (2). 
And so we say that “correlation is an equivariant map for 
the translation group”, or that “correlation and translation 


commute”. Using an analogous computation one can show 
that also for the convolution, |L; f] * Y = Llf * wv). 


+t—2) 
(8) 
(x — t)) 


Although convolutions are equivariant to translation, they 
are not equivariant to other isometries of the sampling lat- 
tice. For instance, as shown in the supplementary material, 
rotating the image and then convolving with a fixed filter is 
not the same as first convolving and then rotating the result: 


[Erf] * Yl) = Delf * [Lr-1Y]] (a) (9) 


In words, this says that the correlation of a rotated image 
L,f with a filter Y is the same as the rotation by r of the 
original image f convolved with the inverse-rotated filter 
L,-1w. Hence, if an ordinary CNN learns rotated copies 
of the same filter, the stack of feature maps is equivariant, 
although individual feature maps are not. 


6. Group Equivariant Networks 


In this section we will define the three layers used in a G- 
CNN (G-convolution, G-pooling, nonlinearity) and show 
that each one commutes with G-transformations of the do- 
main of the image. 


6.1. G-Equivariant correlation 


The correlation (eq. 7) is computed by shifting a filter and 
then computing a dot product with the feature maps. By 
replacing the shift by a more general transformation from 
some group G, we get the G-correlation used in the first 


layer of a G-CNN: 
xylo) = SSO feede(g-ty). aO 
yeZ? k 
Notice that both the input image f and the filter Y% are func- 
tions of the plane Z?, but the feature map f xw is a function 
on the discrete group G (which may contain translations as 
a subgroup). Hence, for all layers after the first, the filters Y 


must also be functions on G, and the correlation operation 


becomes 
=X $ fh 


hEG k 


[f«vl(g gr hy: (11) 


The equivariance of this operation is derived in complete 
analogy to eq. 8, now using the substitution h + uh: 


= 2 Die OY 


heG k 


=X X f)v(g tuh) 
heG k (12) 


=O Eyer 


heG k 


= [Lulf * Y] (9) 


The equivariance of eq. 10 is derived similarly. Note that 
although equivariance is expressed by the same formula 
[Lu f] * Y = Ll f xY] for both first-layer G-correlation 
(eq. 10) and full G-correlation (11), the meaning of the 
operator L,, is different: for the first layer correlation, the 
inputs f and w are functions on Z?, so Laf denotes the 
transformation of such a function, while La [f xY] denotes 
the transformation of the feature map, which is a function 
on G. For the full G-correlation, both the inputs f and 
and the output f xp are functions on G. 


[Lu fl * Vg (gh) 


Note that if G is not commutative, neither the G- 
convolution nor the G-correlation is commutative. How- 
ever, the feature maps w x f and f x w are related by the 
involution (eq. 6): 


fy = (px fY. (13) 


Since the involution is invertible (it is its own inverse), the 
information content of fx and Yx f is the same. However, 
f xv is more efficient to compute when using the method 
described in section 7, because transforming a small filter 
is faster than transforming a large feature map. 


It is customary to add a bias term to each feature map in 
a convolution layer. This can be done for G-conv layers 
as well, as long as there is only one bias per G-feature 
map (instead of one bias per spatial feature plane within 
a G-feature map). Similarly, batch normalization (Ioffe & 
Szegedy, 2015) should be implemented with a single scale 
and bias parameter per G-feature map in order to preserve 
equivariance. The sum of two G-equivariant feature maps 
is also G-equivariant, thus G-conv layers can be used in 
highway networks and residual networks (Srivastava et al., 
2015; He et al., 2015). 


6.2. Pointwise nonlinearities 


Equation 12 shows that G-correlation preserves the trans- 
formation properties of the previous layer. What about non- 
linearities and pooling? 


Recall that we think of feature maps as functions on G. In 
this view, applying a nonlinearity v : R —> R to a feature 
map amounts to function composition. We introduce the 
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composition operator 


Cu f(g) = Ivo f(g) = ¥(F(9))- (14) 


which acts on functions by post-composing them with v. 


Since the left transformation operator L acts by pre- 
composition, C and L commute: 


Cjiaf=volfoh |= [ve | oh * = LC, f, 5) 


so the rectified feature map inherits the transformation 
properties of the previous layer. 


6.3. Subgroup pooling and coset pooling 


In order to simplify the analysis, we split the pooling op- 
eration into two steps: the pooling itself (performed with- 
out stride), and a subsampling step. The non-strided max- 
pooling operation applied to a feature map f : G — R can 
be modeled as an operator P that acts on f as 


Pf(g) = max f(k), (16) 


kEgU 


where gU = {gu | u € U} is the g-transformation of some 
pooling domain U C G (typically a neighborhood of the 
identity transformation). In a regular convnet, U is usually 
a2 x 2 or 3 x 3 square including the origin (0,0), and g is 
a translation. 


As shown in the supplementary material, pooling com- 
mutes with Lp: 
PL, = LP (17) 


Since pooling tends to reduce the variation in a feature map, 
it makes sense to sub-sample the pooled feature map, or 
equivalently, to do a “pooling with stride”. In a G-CNN, 
the notion of “stride” is generalized by subsampling on a 
subgroup H C G. That is, H is a subset of G that is itself a 
group (i.e. closed under multiplication and inverses). The 
subsampled feature map is then equivariant to H but not G. 


In a standard convnet, pooling with stride 2 is the same as 
pooling and then subsampling on H = {(22, 27) |(i,7) € 
Z?\ which is a subgroup of G = Z?. For the p4-CNN, we 
may subsample on the subgroup H containing all 4 rota- 
tions, as well as shifts by multiples of 2 pixels. 


We can obtain full G-equivariance by choosing our pooling 
region U to be a subgroup H C G. The pooling domains 
gH that result are called cosets in group theory. The cosets 
partition the group into non-overlapping regions. The fea- 
ture map that results from pooling over cosets is invariant 
to the right-action of H, because the cosets are similarly in- 
variant (ghH = gH). Hence, we can arbitrarily choose one 
coset representative per coset to subsample on. The feature 
map that results from coset pooling may be thought of as 


a function on the quotient space G/H, in which two trans- 
formations are considered equivalent if they are related by 
a transformation in H. 


As an example, in a p4 feature map, we can pool over all 
four rotations at each spatial position (the cosets of the sub- 
group R of rotations around the origin). The resulting fea- 
ture map is a function on Z? © p4/R, i.e. it will transform 
in the same way as the input image. Another example is 
given by a feature map on Z, where we could pool over the 
cosets of the subgroup nZ of shifts by multiples of n. This 
gives a feature map on Z/nZ, which has a cyclic transfor- 
mation law under translations. 


This concludes our analysis of G-CNNs. Since all layer 
types are equivariant, we can freely stack them into deep 
networks and expect G-conv parameter sharing to be effec- 
tive at arbitrary depth. 


7. Efficient Implementation 


Computing the G-convolution for a discrete group involves 
nothing more than indexing arithmetic and inner products, 
so it can be implemented straightforwardly using a loop or 
as a parallel GPU kernel. Here we present the details for a 
G-convolution implementation that can leverage recent ad- 
vances in fast computation of planar convolutions (Mathieu 
et al., 2014; Vasilache et al., 2015; Lavin & Gray, 2015). 


A plane symmetry group G is called split if any transfor- 
mation g € G can be decomposed into a translation t € Z? 
and a transformation s in the stabilizer of the origin (i.e. s 
leaves the origin invariant). For the group p4, we can write 
g = ts for t a translation and s a rotation about the origin, 
while p4m splits into translations and rotation-flips. Us- 
ing this split of G and the fact that L,L, = Lgn, we can 
rewrite the G-correlation (eq. 10 and 11) as follows: 


fxolts)= X So fal(A)Le Lsde(h)] A8) 


heX k 
where X = Z? in layer one and X = G in further layers. 


Thus, to compute the p4 (or p4m) correlation f x we can 
first compute L.y (“filter transformation”) for all four ro- 
tations (or all eight rotation-flips) and then call a fast planar 
correlation routine on f and the augmented filter bank. 


The computational cost of the algorithm presented here is 
roughly equal to that of a planar convolution with a filter 
bank that is the same size as the augmented filter bank used 
in the G-convolution, because the cost of the filter transfor- 
mation is negligible. 


7.1. Filter transformation 


The set of filters at layer l is stored in an array F'[-] of shape 
K! x K'-! x S'!-1 x n x n, where K! is the number of 
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channels at layer J, S’~! denotes the number of transfor- 
mations in G' that leave the origin invariant (e.g. 1, 4 or 8 
for Z?, p4 or p4m filters, respectively), and n is the spa- 
tial (or translational) extent of the filter. Note that typically, 
S! = 1 for 2D images, while S! = 4 or S! = 8 for l > 1. 


The filter transformation L, amounts to a permutation of 
the entries of each of the K! x K’! scalar-valued filter 
channels in F. Since we are applying S’ transformations to 
each filter, the output of this operation is an array of shape 
K! x Sb x K! x §'-1 x n x n, which we call F+. 


The permutation can be implemented efficiently by a GPU 
kernel that does a lookup into F' for each output cell of 
F*, using a precomputed index associated with the output 
cell. To precompute the indices, we define an invertible 
map g(s, u, v) that takes an input index (valid for an array 
of shape S’—! x n x n) and produces the associated group 
element g as a matrix (section 4.2 and 4.3). For each in- 
put index (s, u, v) and each transformation s’, we compute 
5,u,0 = g~'(g(s’,0,0)~1g(s,u,v)). This index is used 


The G-convolution for a new group can be added by simply 
implementing a map g(-) from indices to matrices 


7.2. Planar convolution 


The second part of the G-convolution algorithm is a pla- 
nar convolution using the expanded filter bank F+. If 
S'-1 > 1, the sum over X in eq. 18 involves a sum over 
the stabilizer. This sum can be folded into the sum over fea- 
ture channels performed by the planar convolution routine 
by reshaping F* from K! x S! x K'"! x S171 xn x nto 
S!K! x S'-1 kK"! xnxn. The resulting array can be inter- 
preted as a conventional filter bank with S’~!K'~! planar 
input channels and S’ K’ planar output channels, which can 
be correlated with the feature maps f (similarly reshaped). 


8. Experiments 
8.1. Rotated MNIST 


The rotated MNIST dataset (Larochelle et al., 2007) con- 
tains 62000 randomly rotated handwritten digits. The 
dataset is split into a training, validation and test sets of 
size 10000, 2000 and 50000, respectively. 


We performed model selection using the validation set, 
yielding a CNN architecture (Z2CNN) with 7 layers of 
3 x 3 convolutions (4 x 4 in the final layer), 20 channels 
in each layer, relu activation functions, batch normaliza- 
tion, dropout, and max-pooling after layer 2. For optimiza- 
tion, we used the Adam algorithm (Kingma & Ba, 2015). 
This baseline architecture outperforms the models tested 
by Larochelle et al. (2007) (when trained on 12k and eval- 
uated on 50k), but does not match the previous state of the 


art, which uses prior knowledge about rotations (Schmidt 
& Roth, 2012) (see table 1). 


Next, we replaced each convolution by a p4-convolution 
(eq. 10 and 11), divided the number of filters by V4 = 
2 (so as to keep the number of parameters approximately 
fixed), and added max-pooling over rotations after the last 
convolution layer. This architecture (PACNN) was found 
to perform better without dropout, so we removed it. The 
P4CNN almost halves the error rate of the previous state of 
the art (2.28% vs 3.98% error). 


We then tested the hypothesis that premature invariance is 
undesirable in a deep architecture (section 2). We took 
the Z2CNN, replaced each convolution layer by a p4- 
convolution (eq. 10) followed by a coset max-pooling over 
rotations. The resulting feature maps consist of rotation- 
invariant features, and have the same transformation law as 
the input image. This network (P4CNNRotationPooling) 
outperforms the baseline and the previous state of the art, 
but performs significantly worse than the P4CNN which 
does not pool over rotations in intermediate layers. 


Network Test Error (%) 
Larochelle et al. (2007) 10.38 + 0.27 
Sohn & Lee (2012) 4.2 
Schmidt & Roth (2012) 3.98 
Z2CNN 5.03 + 0.0020 


P4CNNRotationPooling 3.21 + 0.0012 
P4CNN 2.28 + 0.0004 


Table 1. Error rates on rotated MNIST (with standard deviation 
under variation of the random seed). 


8.2. CIFAR-10 


The CIFAR-10 dataset consists of 60k images of size 32 x 
32, divided into 10 classes. The dataset is split into 40k 
training, 10k validation and 10k testing splits. 


We compared the p4-, p4m- and standard planar Z? con- 
volutions on two kinds of baseline architectures. Our first 
baseline is the AlI-CNN-C architecture by Springenberg 
et al. (2015), which consists of a sequence of 9 strided and 
non-strided convolution layers, interspersed with rectified 
linear activation units, and nothing else. Our second base- 
line is a residual network (He et al., 2016), which consists 
of an initial convolution layer, followed by three stages of 
2n convolution layers using k; filters at stage i, followed 
by a final classification layer (6n + 2 layers in total). The 
first convolution in each stage ¿ > 1 uses a stride of 2, so 
the feature map sizes are 32, 16, and 8 for the three stages. 
We use n = 7, k; = 32,64, 128 yielding a wide 44-layer 
network called ResNet44. 
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To evaluate G-CNNs, we replaced all convolution layers of 
the baseline architectures by p4 or p4m convolutions. For a 
constant number of filters, this increases the size of the fea- 
ture maps 4 or 8-fold, which in turn increases the number of 
parameters required per filter in the next layer. Hence, we 
halve the number of filters in each p4-conv layer, and divide 
it by roughly \/8 ~ 3 in each p4m-conv layer. This way, 
the number of parameters is left approximately invariant, 
while the size of the internal representation is increased. 
Specifically, we used k; = 11, 23, 45 for p4m-ResNet44. 


To evaluate the impact of data augmentation, we compare 
the networks on CIFAR10 and augmented CIFAR10+. The 
latter denotes moderate data augmentation with horizon- 
tal flips and small translations, following Goodfellow et al. 
(2013) and many others. 


The training procedure for training the All-CNN was re- 
produced as closely as possible from Springenberg et al. 
(2015). For the ResNets, we used stochastic gradient de- 
scent with initial learning rate of 0.05 and momentum 0.9. 
The learning rate was divided by 10 at epoch 50, 100 and 
150, and training was continued for 300 epochs. 


Network G CIFARIO CIFARI0O+ Param. 
All-CNN Zz? 9.44 8.86 1.37M 
p4 8.84 7.67 1.37M 

pam 7.59 7.04 1.22M 

ResNet44 | Z? 9.45 5.61 2.64M 
pam 6.46 4.94 2.62M 


Table 2. Comparison of conventional (i.e. Z?), p4 and p4m CNNs 
on CIFAR10 and augmented CIFAR10+. Test set error rates and 
number of parameters are reported. 


To the best of our knowledge, the p4m-CNN outperforms 
all published results on plain CIFAR10 (Wan et al., 2013; 
Goodfellow et al., 2013; Lin et al., 2014; Lee et al., 2015b; 
Srivastava et al., 2015; Clevert et al., 2015; Lee et al., 
2015a). However, due to radical differences in model sizes 
and architectures, it is difficult to infer much about the in- 
trinsic merit of the various techniques. It is quite possi- 
ble that the cited methods would yield better results when 
deployed in larger networks or in combination with other 
techniques. Extreme data augmentation and model ensem- 
bles can also further improve the numbers (Graham, 2014). 


Inspired by the wide ResNets of Zagoruyko & Komodakis 
(2016), we trained another ResNet with 26 layers and 
ki = (71,142,248) (for planar convolutions) or k; = 
(50, 100, 150) (for p4m convolutions). When trained with 
moderate data augmentation, this network achieves an er- 
ror rate of 5.27% using planar convolutions, and 4.19% 
with p4m convolutions. This result is comparable to the 
4.17% error reported by Zagoruyko & Komodakis (2016), 
but using fewer parameters (7.2M vs 36.5M). 


9. Discussion & Future work 


Our results show that p4 and p4m convolution layers can 
be used as a drop-in replacement of standard convolutions 
that consistently improves the results. 


G-CNNs benefit from data augmentation in the same way 
as convolutional networks, as long as the augmentation 
comes from a group larger than Œ. Augmenting with flips 
and small translations consistently improves the results for 
the p4 and p4m-CNN. 


The CIFAR dataset is not actually symmetric, since objects 
typically appear upright. Nevertheless, we see substantial 
increases in accuracy on this dataset, indicating that there 
need not be a full symmetry for G-convolutions to be ben- 
eficial. 


In future work, we want to implement G-CNNs that work 
on hexagonal lattices which have an increased number of 
symmetries relative to square grids, as well as G-CNNs for 
3D space groups. All of the theory presented in this paper is 
directly applicable to these groups, and the G-convolution 
can be implemented in such a way that new groups can 
be added by simply specifying the group operation and a 
bijective map between the group and the set of indices. 


One limitation of the method as presented here is that it 
only works for discrete groups. Convolution on continuous 
(locally compact) groups is mathematically well-defined, 
but may be hard to approximate in an equivariant manner. 
A further challenge, already identified by Gens & Domin- 
gos (2014), is that a full enumeration of transformations in 
a group may not be feasible if the group is large. 


Finally, we hope that the current work can serve as a con- 
crete example of the general philosophy of “structured rep- 
resentations”, outlined in section 2. We believe that adding 
mathematical structure to a representation (making sure 
that maps between representations preserve this structure), 
could enhance the ability of neural nets to see abstract sim- 
ilarities between superficially different concepts. 


10. Conclusion 


We have introduced G-CNNs, a generalization of convolu- 
tional networks that substantially increases the expressive 
capacity of a network without increasing the number of 
parameters. By exploiting symmetries, G-CNNs achieve 
state of the art results on rotated MNIST and CIFAR1O. 
We have developed the general theory of G-CNNs for dis- 
crete groups, showing that all layer types are equivariant to 
the action of the chosen group G. Our experimental results 
show that G-convolutions can be used as a drop-in replace- 
ment for spatial convolutions in modern network architec- 
tures, improving their performance without further tuning. 


Group Equivariant Convolutional Networks 


Acknowledgements 


We would like to thank Joan Bruna, Sander Dieleman, 
Robert Gens, Chris Olah, and Stefano Soatto for helpful 
discussions. This research was supported by NWO (grant 
number NAI.14.108), Google and Facebook. 


References 


Agrawal, P., Carreira, J., and Malik, J. Learning to See 
by Moving. In International Conference on Computer 
Vision (ICCV), 2015. 


Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tacchetti, 
A., and Poggio, T. Unsupervised learning of invariant 
representations with low sample complexity: the magic 
of sensory cortex or a new framework for machine learn- 
ing? Technical Report 001, MIT Center for Brains, 
Minds and Machines, 2014. 


Anselmi, F., Rosasco, L., and Poggio, T. On Invariance and 
Selectivity in Representation Learning. Technical report, 
MIT Center for Brains, Minds and Machines, 2015. 


Bruna, J. and Mallat, S. Invariant scattering convolu- 
tion networks. IEEE Transactions on Pattern Analysis 
and Machine Intelligence (TPAMI), 35(8):1872-86, aug 
2013. 


Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and 
Accurate Deep Network Learning by Exponential Linear 
Units (ELUs). arXiv:1511.07289v3, 2015. 


Cohen, T. and Welling, M. Learning the Irreducible Repre- 
sentations of Commutative Lie Groups. In Proceedings 
of the 31st International Conference on Machine Learn- 
ing (ICML), volume 31, pp. 1755-1763, 2014. 


Cohen, T. S. and Welling, M. Transformation Properties of 
Learned Visual Representations. International Confer- 
ence on Learning Representations (ICLR), 2015. 


Dieleman, S., Willett, K. W., and Dambre, J. Rotation- 
invariant convolutional neural networks for galaxy mor- 
phology prediction. Monthly Notices of the Royal Astro- 
nomical Society, 450(2), 2015. 


Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Ex- 
ploiting Cyclic Symmetry in Convolutional Neural Net- 
works. In International Conference on Machine Learn- 
ing (ICML), 2016. 


Gens, R. and Domingos, P. Deep Symmetry Networks. 
In Advances in Neural Information Processing Systems 


(NIPS), 2014. 


Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, 
A., and Bengio, Y. Maxout Networks. In Proceedings of 


the 30th International Conference on Machine Learning 


(ICML), pp. 1319-1327, 2013. 


Graham, B. Fractional Max-Pooling. arXiv:1412.6071, 


2014. 


He, K., Zhang, X., Ren, S., and Sun, J. Deep Resid- 
ual Learning for Image Recognition. arXiv:1512.03385, 
2015. 


He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, 
Jian. Identity Mappings in Deep Residual Networks. 
arXiv: 1603.05027, 2016. 


Hinton, G. E., Krizhevsky, A., and Wang, S. D. Trans- 
forming auto-encoders. ICANN-11: International Con- 
ference on Artificial Neural Networks, Helsinki, 2011. 


Ioffe, S. and Szegedy, C. Batch Normalization : Acceler- 
ating Deep Network Training by Reducing Internal Co- 
variate Shift. arXiv:1502.03167v3, 2015. 


Jaderberg, M., Simonyan, K., Zisserman, A., and 
Kavukcuoglu, K. Spatial Transformer Networks. In 
Advances in Neural Information Processing Systems 28 


(NIPS 2015), 2015. 


Kingma, D. and Ba, J. Adam: A Method for Stochastic 
Optimization. In Proceedings of the International Con- 
ference on Learning Representations (ICLR), 2015. 


Kivinen, Jyri J. and Williams, Christopher K I. Transfor- 
mation equivariant Boltzmann machines. In 2/st Inter- 
national Conference on Artificial Neural Networks, jun 


2011. 


Kondor, R. A novel set of rotationally and translation- 
ally invariant features for images based on the non- 
commutative bispectrum. arXiv:0701127, 2007. 


Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet 
classification with deep convolutional neural networks. 
Advances in Neural Information Processing Systems, 25, 


2012. 


Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and 
Bengio, Y. An empirical evaluation of deep architectures 
on problems with many factors of variation. Proceedings 
of the 24th International Conference on Machine Learn- 
ing (ICML), 2007. 


Lavin, A. and Gray, S. Fast Algorithms for Convolutional 
Neural Networks. arXiv: 1509.093068, 2015. 


LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na- 
ture, 521(7553):436—444, 2015. 


Lee, C., Gallagher, P. W., and Tu, Z. Generalizing Pooling 
Functions in Convolutional Neural Networks: Mixed, 
Gated, and Tree. ArXiv: 1509.08985, 2015a. 


Group Equivariant Convolutional Networks 


Lee, C., Xie, S., Gallagher, P.W., Zhang, Z., and Tu, 
Z. Deeply-Supervised Nets. In Proceedings of the 
Eighteenth International Conference on Artificial Intel- 
ligence and Statistics (AISTATS), volume 38, pp. 562- 
570, 2015b. 


Lenc, K. and Vedaldi, A. Understanding image represen- 
tations by measuring their equivariance and equivalence. 
In Proceedings of the IEEE Conf: on Computer Vision 
and Pattern Recognition (CVPR), 2015. 


Lin, M., Chen, Q., and Yan, S. Network In Network. 
International Conference on Learning Representations 


(ICLR), 2014. 


Lowe, D.G. Distinctive Image Features from Scale- 
Invariant Keypoints. International Journal of Computer 
Vision, 60(2):91—110, nov 2004. 


Manay, Siddharth, Cremers, Daniel, Hong, Byung Woo, 
Yezzi, Anthony J., and Soatto, Stefano. Integral invari- 
ants for shape matching. [EEE Transactions on Pattern 
Analysis and Machine Intelligence, 28(10):1602—1617, 
2006. ISSN 01628828. doi: 10.1109/TPAMI.2006.208. 


Mathieu, M., Henaff, M., and LeCun, Y. Fast Training of 
Convolutional Networks through FFTs. In International 
Conference on Learning Representations (ICLR), 2014. 


Oyallon, E. and Mallat, S. Deep Roto-Translation Scat- 
tering for Object Classification. In IEEE Conference on 
Computer Vision and Pattern Recognition (CVPR), pp. 
2865—-2873, 2015. 


Reisert, Marco. Group Integration Techniques in Pattern 
Analysis. PhD thesis, Albert-Ludwigs-University, 2008. 


Schmidt, U. and Roth, S. Learning rotation-aware fea- 
tures: From invariant priors to equivariant descriptors. 
Proceedings of the IEEE Computer Society Conference 
on Computer Vision and Pattern Recognition (CVPR), 
2012. 


Sifre, Laurent and Mallat, Stephane. Rotation, Scaling and 
Deformation Invariant Scattering for Texture Discrimi- 
nation. IEEE conference on Computer Vision and Pat- 
tern Recognition (CVPR), 2013. 


Skibbe, H. Spherical Tensor Algebra for Biomedical Im- 
age Analysis. PhD thesis, Albert-Ludwigs-Universitat 
Freiburg im Breisgau, 2013. 


Sohn, K. and Lee, H. Learning Invariant Representations 
with Local Transformations. Proceedings of the 29th 
International Conference on Machine Learning (ICML- 


12), 2012. 


Springenberg, J.T., Dosovitskiy, A., Brox, T., and Ried- 
miller, M. Striving for Simplicity: The All Convolu- 
tional Net. Proceedings of the International Conference 
on Learning Representations (ICLR), 2015. 


Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhu- 
ber, Jiirgen. Training Very Deep Networks. Advances in 
Neural Information Processing Systems (NIPS), 2015. 


Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Pi- 
antino, S., and LeCun, Y. Fast convolutional nets with 
fbfft: A GPU performance evaluation. In International 
Conference on Learning Representations (ICLR), 2015. 


Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, 
R. Regularization of neural networks using dropconnect. 
International Conference on Machine Learning (ICML), 
pp. 109-111, 2013. 


Zagoruyko, S. and Komodakis, N. Wide Residual Net- 
works. arXiv:1605.07146, 2016. 


Zhang, C., Voinea, S., Evangelopoulos, G., Rosasco, L., 
and Poggio, T. Discriminative template learning in 
group-convolutional networks for invariant speech rep- 
resentations. InterSpeech, pp. 3229-3233, 2015. 


